AITopics

Country:

Asia (0.46)
North America (0.28)

Genre: Research Report (1.00)

Industry:

Law (0.67)
Information Technology (0.46)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Neural Information Processing SystemsApr-30-2026, 07:24:13 GMT

VisAlign: Dataset for Measuring the Alignment between AI and Humans in Visual Perception

AI alignment refers to models acting towards human-intended goals, preferences, or ethical principles. In this paper, we focus on the models' visual perception alignment with humans, further referred to as AI-human visual alignment. Specifically, we propose a new dataset for measuring AI-human visual alignment in terms of image classification. In order to evaluate AI-human visual alignment, a dataset should encompass samples with various scenarios and have gold human perception labels. Our dataset consists of three groups of samples, namely Must-Act (i.e., Must-Classify), Must-Abstain, and Uncertain, and further divided into eight categories. All samples have a gold human perception label; even Uncertain (e.g., severely blurry) sample labels were obtained via crowd-sourcing. The validity of our dataset is verified by sampling theory, statistical theories related to survey design, and experts in the related fields. Using our dataset, we analyze the visual alignment and reliability of five popular visual perception models and eight abstention methods.

artificial intelligence, machine learning, natural language, (19 more...)

Country:

Asia (0.45)
North America > United States (0.27)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.92)
Law (0.92)
Government (0.67)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.92)
(2 more...)

Neural Information Processing SystemsFeb-17-2026, 23:01:14 GMT

Appendix A Datasheet for Datasets

The negative prompt used is " unrealistic, bad anatomy, wrong anatomy, extra limb, missing limb,

artificial intelligence, dataset, machine learning, (15 more...)

Country:

South America > Brazil (0.04)
North America > United States (0.04)
North America > Canada (0.04)
(6 more...)

Genre: Research Report (1.00)

Industry:

Information Technology (0.67)
Law (0.67)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Neural Information Processing SystemsFeb-17-2026, 23:01:11 GMT

f37aba0f53fdb59f53254fe9098b2177-Paper-Datasets_and_Benchmarks.pdf

artificial intelligence, machine learning, natural language, (18 more...)

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > South Korea > Seoul > Seoul (0.04)
Asia > India (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (1.00)
Law (0.92)
Government (0.67)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Das, Soumojit, Dasgupta, Nairanjana, Dutta, Prashanta

Geometric Calibration and Neutral Zones for Uncertainty-Aware Multi-Class Classification

arXiv.org Machine LearningDec-2-2025

Modern artificial intelligence systems make critical decisions yet often fail silently when uncertain -- even well-calibrated models provide no mechanism to identify \textit{which specific predictions} are unreliable. We develop a geometric framework addressing both calibration and instance-level uncertainty quantification for neural network probability outputs. Treating probability vectors as points on the $(c-1)$-dimensional probability simplex equipped with the Fisher--Rao metric, we construct: (i) Additive Log-Ratio (ALR) calibration maps that reduce exactly to Platt scaling for binary problems while extending naturally to multi-class settings, and (ii) geometric reliability scores that translate calibrated probabilities into actionable uncertainty measures, enabling principled deferral of ambiguous predictions to human review. Theoretical contributions include: consistency of the calibration estimator at rate $O_p(n^{-1/2})$ via M-estimation theory (Theorem~1), and tight concentration bounds for reliability scores with explicit sub-Gaussian parameters enabling sample size calculations for validation set design (Theorem~2). We conjecture Neyman--Pearson optimality of our neutral zone construction based on connections to Bhattacharyya coefficients. Empirical validation on Adeno-Associated Virus classification demonstrates that the two-stage framework captures 72.5\% of errors while deferring 34.5\% of samples, reducing automated decision error rates from 16.8\% to 6.9\%. Notably, calibration alone yields marginal accuracy gains; the operational benefit arises primarily from the reliability scoring mechanism, which applies to any well-calibrated probability output. This work bridges information geometry and statistical learning, offering formal guarantees for uncertainty-aware classification in applications requiring rigorous validation.

calibration, prediction, reliability score, (15 more...)

arXiv.org Machine Learning

2511.2096

Country:

North America > United States > Washington > Whitman County > Pullman (0.04)
North America > United States > New York (0.04)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (0.67)
Research Report > New Finding (0.46)

Industry:

Health & Medicine > Diagnostic Medicine (0.68)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)

Ronco, Michele, Delforge, Damien, Jäger, Wiebke S., Corbane, Christina

Subnational Geocoding of Global Disasters Using Large Language Models

arXiv.org Artificial IntelligenceNov-20-2025

Subnational location data of disaster events are critical for risk assessment and disaster risk reduction. Disaster databases such as EM-DAT often report locations in unstructured textual form, with inconsistent granularity or spelling, that make it difficult to integrate with spatial datasets. We present a fully automated LLM-assisted workflow that processes and cleans textual location information using GPT-4o, and assigns geometries by cross-checking three independent geoinformation repositories: GADM, OpenStreetMap and Wikidata. Based on the agreement and availability of these sources, we assign a reliability score to each location while generating subnational geometries. Applied to the EM-DAT dataset from 2000 to 2024, the workflow geocodes 14,215 events across 17,948 unique locations. Unlike previous methods, our approach requires no manual intervention, covers all disaster types, enables cross-verification across multiple sources, and allows flexible remapping to preferred frameworks. Beyond the dataset, we demonstrate the potential of LLMs to extract and structure geographic information from unstructured text, offering a scalable and reliable method for related analyses.

large language model, machine learning, natural language, (22 more...)

2511.14788

Country:

Europe (1.00)
North America > United States (0.46)
Asia > India (0.28)

Genre:

Workflow (0.88)
Research Report (0.64)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Artificial IntelligenceNov-12-2025

A Decentralized Retrieval Augmented Generation System with Source Reliabilities Secured on Blockchain

Lu, Yining, Tang, Wenyi, Johnson, Max, Jung, Taeho, Jiang, Meng

Existing retrieval-augmented generation (RAG) systems typically use a centralized architecture, causing a high cost of data collection, integration, and management, as well as privacy concerns. There is a great need for a decentralized RAG system that enables foundation models to utilize information directly from data owners who maintain full control over their sources. However, decentralization brings a challenge: the numerous independent data sources vary significantly in reliability, which can diminish retrieval accuracy and response quality. To address this, our decentralized RAG system has a novel reliability scoring mechanism that dynamically evaluates each source based on the quality of responses it contributes to generate and prioritizes high-quality sources during retrieval. To ensure transparency and trust, the scoring process is securely managed through blockchain-based smart contracts, creating verifiable and tamper-proof reliability records without relying on a central authority. We evaluate our decentralized system with two Llama models (3B and 8B) in two simulated environments where six data sources have different levels of reliability. Our system achieves a +10.7\% performance improvement over its centralized counterpart in the real world-like unreliable data environments. Notably, it approaches the upper-bound performance of centralized systems under ideally reliable data environments. The decentralized infrastructure enables secure and trustworthy scoring management, achieving approximately 56\% marginal cost savings through batched update operations. Our code and system are open-sourced at github.com/yining610/Reliable-dRAG.

large language model, machine learning, natural language, (15 more...)

2511.07577

Country:

North America > United States (0.46)
Europe > Austria (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

arXiv.org Artificial IntelligenceOct-29-2025

M-Eval: A Heterogeneity-Based Framework for Multi-evidence Validation in Medical RAG Systems

Sun, Mengzhou, Zhao, Sendong, Chen, Jianyu, Wang, Haochun, Qin, Bin

Retrieval-augmented Generation (RAG) has demonstrated potential in enhancing medical question-answering systems through the integration of large language models (LLMs) with external medical literature. LLMs can retrieve relevant medical articles to generate more professional responses efficiently. However, current RAG applications still face problems. They generate incorrect information, such as hallucinations, and they fail to use external knowledge correctly. To solve these issues, we propose a new method named M-Eval. This method is inspired by the heterogeneity analysis approach used in Evidence-Based Medicine (EBM). Our approach can check for factual errors in RAG responses using evidence from multiple sources. First, we extract additional medical literature from external knowledge bases. Then, we retrieve the evidence documents generated by the RAG system. We use heterogeneity analysis to check whether the evidence supports different viewpoints in the response. In addition to verifying the accuracy of the response, we also assess the reliability of the evidence provided by the RAG system. Our method shows an improvement of up to 23.31% accuracy across various LLMs. This work can help detect errors in current RAG-based medical systems. It also makes the applications of LLMs more reliable and reduces diagnostic errors.

large language model, machine learning, natural language, (17 more...)

2510.23995

Country: Asia > China (0.16)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Diagnostic Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningOct-21-2025

Data Reliability Scoring

Chen, Yiling, Feng, Shi, Kattuman, Paul, Yu, Fang-Yi

How can we assess the reliability of a dataset without access to ground truth? We introduce the problem of reliability scoring for datasets collected from potentially strategic sources. The true data are unobserved, but we see outcomes of an unknown statistical experiment that depends on them. To benchmark reliability, we define ground-truth-based orderings that capture how much reported data deviate from the truth. We then propose the Gram determinant score, which measures the volume spanned by vectors describing the empirical distribution of the observed data and experiment outcomes. We show that this score preserves several ground-truth based reliability orderings and, uniquely up to scaling, yields the same reliability ranking of datasets regardless of the experiment -- a property we term experiment agnosticism. Experiments on synthetic noise models, CIFAR-10 embeddings, and real employment data demonstrate that the Gram determinant score effectively captures data quality across diverse observation processes.

data mining, gram determinant score, machine learning, (16 more...)

arXiv.org Machine Learning

2510.17085

Country: North America > United States (1.00)

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Banking & Finance (1.00)
Government > Tax (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining (0.67)

arXiv.org Artificial IntelligenceOct-3-2025

Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability

Yoo, Seungju, Kwon, Hyuk, Hwang, Joong-Won, Lee, Kibok

Recent advances in computer vision have made training object detectors more efficient and effective; however, assessing their performance in real-world applications still relies on costly manual annotation. To address this limitation, we develop an automated model evaluation (AutoEval) framework for object detection. We propose Prediction Consistency and Reliability (PCR), which leverages the multiple candidate bounding boxes that conventional detectors generate before non-maximum suppression (NMS). PCR estimates detection performance without ground-truth labels by jointly measuring 1) the spatial consistency between boxes before and after NMS, and 2) the reliability of the retained boxes via the confidence scores of overlapping boxes. For a more realistic and scalable evaluation, we construct a meta-dataset by applying image corruptions of varying severity. Experimental results demonstrate that PCR yields more accurate performance estimates than existing AutoEval methods, and the proposed meta-dataset covers a wider range of detection performance. The code is available at https://github.com/YonseiML/autoeval-det.

artificial intelligence, detection, machine learning, (18 more...)

2508.12082

Country: Europe > Switzerland (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)